This tutorial provides an approach to basic analysis for single-cell RNA sequencing data:
Cluster the cells
Visualization
Differentially expressed genes
Identify the cell types
Enrichment analysis
This tutorial refers to Seurat, a software(package) for analyzing single-cell RNA seq data in R.
Feel free to contact me if you have any questions. Email: feng_huijian@gibh.ac.cn Wechat: rick__sanchez__
All the anaylsis below relies on the R environment. Please download R and related software before the coures.
We select the tsinghua mirror to download R
Download R for Windows, and then click install R for the first time.Download R for (Mac) OS X, and then click R-3.6.1.pkg.Download and install Rstudio, and then select RStudio Desktop Open Source License FREE to download Windows Version RStudio 1.3.1056 - Windows 7+ (64-bit) or Mac Version RStudio 1.3.1056 - macOS 10.13+ (64-bit)
Install the R packages for the following analysis.
We also use the tsinghua mirror to download R packages for saving times and avoiding the unnecessary Error. Open the areadly installed Rstudio. Paste and execute the code elow in the Console panel:
# rcran
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
# install bioconductor library.
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
if (!requireNamespace("BiocManager"))
install.packages("BiocManager")
# bioconductor
options(useHTTPS=FALSE, BioC_mirror="http://mirrors.tuna.tsinghua.edu.cn/bioconductor/")
# A function: intalling the packages to R ->IPR
IPR <- function(packages,
sources = c("RCran", "Bioconductor", "Github")){
installedpackages <- installed.packages()[ ,"Package"]
existpackages <- intersect(packages, installedpackages)
if(length(existpackages) != 0){
print(paste0("Packages ", existpackages, " have been installed"))
}
newpackages <- setdiff(packages, installedpackages)
if(length(newpackages) != 0){
if(sources == "RCran"){
install.packages(newpackages)
}
if(sources == "Bioconductor"){
BiocManager::install(newpackages)
}
if(sources == "Github"){
devtools::install_github(newpackages)
}
}
}
# R CRAN packages
# rpkg <- c("dplyr","ggplot2","matrixStats","pheatmap","DT","data.table", "Seurat")
# IPR(packages = rpkg, sources = "RCran")
# Bioconductor packages
# biopkg <- c("clusterProfiler", "org.Mm.eg.db", "org.Hs.eg.db", "genefilter", "ReactomePA")
# IPR(packages = biopkg, sources = "Bioconductor")In this tutorial, we provide a dataset Peripheral Blood Mononuclear Cells(PBMC) freely available from 10X Genomics. There are ~5k single cells that were sequencing on the Illumina NextSeq500. You can download the data from the 10X Genomics.
DATASETS and click View all.Name, Your Email, Institution, Country. Click no for “receive update” and agree the “Privacy Policy”.Feature / cell matrix (filtered). Warning: This Object size is 41.21 MB.And then you should extract the compressed file into a directory, such as D:/Tutorial/scRNA_seq_tutorial/counts/filtered_feature_bc_matrix(this directory need to be created).
Then, we can check how the 10X single-cell RNA seq data is stored. we open the directoryD:/Tutorial/scRNA_seq_tutorial/counts/filtered_feature_bc_matrix, and we will find that three compressed files store their respective files:
We load the packages by calling library() at firt.
At first, We have to set a correct working directory.
Then we can load the data by calling Read10X(). This function can read the three files(barcodes.tsv.gz, features.tsv.gz, matrix.mtx.gz) in the directory at once.
Then we can setup the seurat object by calling CreateSeuratObject().
We highly encourage querying the detail of a function by calling help(function_name) or?function_name, such as help(CreateSeuratObject).
## Example for set working directory
## setwd("D:/singlecell_pratice")
## loading the data
adata <- Read10X("./counts/filtered_feature_bc_matrix")
## create the seurat object
adata <- CreateSeuratObject(counts = adata,
project = "sc_tutorial",
min.cells = 3,
min.features = 200)
# parameter counts : input the "raws gene expression marix"
# parameter project : input the "name of our project"
# parameter min.cells : exclude some "genes" which are expressed in less than 3 cells
# parameter min.features : exclude some "cells" which are expressing less then 200 genes
## view the data
adata## An object of class Seurat
## 19037 features across 5100 samples within 1 assay
## Active assay: RNA (19037 features)
These workflow represent the selection and filtration of cells based on QC metrics, data normalization and data scaling, and the detection of highly variable genes.
Seurat provides simple QC metrics and allow you to filter cells based on user-defined criterion. A few QC metrics include commonly:
we call PercentageFeatureSet to caculate the percentage of mitochondrial genes. Mitochondrial genes are a set genes which can be captured by matching the prefix MT-.
## caculate the percentage of mitochondrial genes
adata[["percent.mt"]] <- PercentageFeatureSet(adata,
pattern = "^MT-")
# The [[ operator can add columns to object metadata. You can use "adata[[]]" to check object metadata.
# parameter pattern : match regular expression. "^MT-" means that the gene with prefix "MT-"
## function head() allow us to check the first 6 rows of data frame
head(adata[[]])## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACCCAAGACAGCTG sc_tutorial 8384 2550 9.148378
## AAACCCAAGTTAACGA sc_tutorial 11768 3017 6.296737
## AAACCCACAGTCGCAC sc_tutorial 6947 2033 5.570750
## AAACCCAGTAGCTTGT sc_tutorial 9354 2193 7.750695
## AAACCCATCTGCCCTA sc_tutorial 8970 2630 7.781494
## AAACGAAAGCAATTAG sc_tutorial 9291 2286 11.634916
In this data, we visualize the QC metric and use these filter cells.
nCount_RNA means that the total number of reads per cells map to the genome. nFeature_RNA means that the total number of genes expressed per cell. percent.mt means that the percentage of mitochondrial genes per cell.
## Check the distribution of the three QC metric
VlnPlot(adata,
features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
ncol = 3)## parameter
nfeature_min = 200
nfeature_max = 5000
percent_mt = 20
## Check the distribution of the three QC metric
plot1 <- FeatureScatter(adata, feature1 = "nCount_RNA", feature2 = "percent.mt") +
geom_hline(yintercept = percent_mt, linetype="dotted", size = 1, colour = "#ae0404")
plot2 <- FeatureScatter(adata, feature1 = "nCount_RNA", feature2 = "nFeature_RNA") +
geom_hline(yintercept = nfeature_min, linetype="dotted", size = 1, colour = "#04ae59")+
geom_hline(yintercept = nfeature_max, linetype="dotted", size = 1, colour = "#04ae59")
## plot
plot_grid(plot1, plot2, nrow = 2)We can find that samples is 4591 now, before QC samples is 5100.
## QC
# 200< nFeature_RNA < 2500 & percent.mt < 5%
adata <- subset(adata,
subset = nFeature_RNA < nfeature_max & nFeature_RNA > nfeature_min & percent.mt < percent_mt)
## check data
adata## An object of class Seurat
## 19037 features across 4591 samples within 1 assay
## Active assay: RNA (19037 features)
After moving unwanted cells, the next step is to normalize the data. We employ a global-scaling normalizaiton method LogNormalize that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a size factor(10,000 by default), and log-transforms the result. Normalized values are stored in the adata[["RNA"]]@data which is another sparse matrix.
We next caculate a subset of geens that exhibit high cell-to-cell variation in the data(they are highly expressed in some cells,and lowly expressed in others). The high varialble genes can highlight biological signal in single-cell data. we can find high variable genes(HVGs) by calling FindVariableFeature() function. By default, we return top 3000 HVGs in single-cell data. These genes will be used in downstream analysis, like PCA. The resluts are stored in adata[["RNA"]]@var.features.
## caculates hvgs in single-cell data
hvgs = 3000
adata <- FindVariableFeatures(adata,
selection.method = "vst",
nfeatures = hvgs)
# "vst" is a method that model the mean-variance relationship in single-cell data.
# parameter nfeatures is to select top genes
## plot top 3000 HVGs
plot1 <- VariableFeaturePlot(adata)
plot1Next, we apply a linear transformation(“scaling”) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The scaleData function:
adata[["RNA"]]@scale.data# select top 2000 HVGs and perform the scale function
tgenes <- adata[["RNA"]]@var.features
# tgenes <- rownames(adata[["RNA"]])
adata <- ScaleData(adata,
features = tgenes)Brief summary
In the data pre-processing workflow, we do
Next, we perform PCA on the scale.data. By default, only the previously detemined variable features are used as input. we select the top 3000 HVGs as input.
PCA is feature selection techniques whose purpose is to reduce noise, extract features, data compression, increase computing speed and so on. Each PC is linear combination that combines information across a correlated gene set. The top PCs therefore represent a robust comprassion of the dataset. How many components should we choose to include? 10? 20? 100? We can call function ElblowPlot to generates a elbow plot: a ranking of the PCs based on the percentages of variance. In this dataset, we can observe an “elbow” around PC20, suggesting that the majority features are captured in the first 20 PCs.
Seurat applies a graph-based clustering approach. 1. They first construct a NN graph based on the Euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods. Call FindNeighbors() to construct a NN graph. 2. To cluster the cells, they apply the modularity optimization algorithm ( Louvain algorithm, default) to cluster the cells. We can call FindClusters() to do this step. And a parameter resolution is set in the FindClusters(), with increased values leading to a greater number of clusters.
# selcet the first 10 PCs
pcs = 20
# create nn graph
adata <- FindNeighbors(adata,
dims = 1:pcs,
k.param = 20)
# louvain cluster
adata <- FindClusters(adata,
resolution = 0.4)## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 4591
## Number of edges: 162469
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9151
## Number of communities: 13
## Elapsed time: 0 seconds
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACCCAAGACAGCTG sc_tutorial 8384 2550 9.148378
## AAACCCAAGTTAACGA sc_tutorial 11768 3017 6.296737
## AAACCCACAGTCGCAC sc_tutorial 6947 2033 5.570750
## AAACCCAGTAGCTTGT sc_tutorial 9354 2193 7.750695
## AAACCCATCTGCCCTA sc_tutorial 8970 2630 7.781494
## AAACGAAAGCAATTAG sc_tutorial 9291 2286 11.634916
## RNA_snn_res.0.4 seurat_clusters
## AAACCCAAGACAGCTG 2 2
## AAACCCAAGTTAACGA 1 1
## AAACCCACAGTCGCAC 0 0
## AAACCCAGTAGCTTGT 8 8
## AAACCCATCTGCCCTA 6 6
## AAACGAAAGCAATTAG 6 6
Seurat offers several non-linear dimensional reduction techniques, such as tSNE and UMAP, to visualize and explore these datasets. The goal of these algorithms is to learn the underlying manifold of the data in order to place similar cells together in low-dimensional space.
RunUMAPIf you haven’t installed UMAP, you can do so via reticulate::py_install(packages = 'umap-learn')
Seurat can help you find markers that define clusters via differential expression.
# find markers for every cluster compared to all remaining cells, report only the positive ones
adata_marker <- FindAllMarkers(adata,
only.pos = TRUE,
min.pct = 0.25,
logfc.threshold = 0.25)DoHeatmap generates an expression heatmap for given cells and features. In this case, we are plotting the top 10 markers (or all markers if less than 10) for each cluster.
# select top 10 markers for each cluster.
top10 <- adata_marker %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
# heatmap the top 10 markers for each cluster.
DoHeatmap(adata,
features = top10$gene,
raster = T,
label = F,
size = 2) +
theme(axis.text.y = element_blank())In this PBMC data, we can use the canonical markers to identify cell types and to mach the cell clusters.
| Cell Clusters | Gene Markers | Cell Types |
|---|---|---|
| 5,6 | MS4A1 | B |
| 4 | GNLY, NKG7 | NK |
| 1 | CD14, LYZ | CD14+ Mono |
| 0,8,9 | IL7R, CCR7 | Naive CD4+T |
| 10,12 | FCER1A, CST3 | DC |
| 7 | MS4A7, FCGR3A | FCGR3A+ Mono |
| 3 | IL7R, S100A4 | Memory CD4+ |
| 2 | CD8A | CD8+ T |
| 11 | PPBP | Platelet |
# B, NK
FeaturePlot(adata,
features = c("MS4A1", # B
"GNLY", "NKG7"), # NK
cols = c("grey", "red", "red"),
ncol = 2)# CD14+Mono, Naive CD4+T
FeaturePlot(adata,
features = c("CD14", "LYZ", # CD14+Mono
"IL7R", "CCR7"), # Naive CD4+T
cols = c("grey", "red", "red"),
ncol = 2)So we can plot the cell types labels on the each cluster
adata[["cell_types"]] <- "undetermined"
adata@meta.data$cell_labels <- as.character(adata@meta.data$seurat_clusters)
celltype_list = list("Naive CD4+ T"=c("0","8","9"),
"Memory CD4+"=c("3"),
"CD14+ Mono"=c("1"),
"B"=c("5","6"),
"CD8+ T"=c("2"),
"FCGR3A+ Mono"=c("7"),
"NK"=c("4"),
"DC"=c("10","12"),
"Platelet"=c("11"))
for(i in names(celltype_list)){
adata@meta.data$cell_types[adata@meta.data$cell_labels %in% celltype_list[[i]]] <- i
}We have to transform the gene SYMBOL to ENTREZID before enrichment
# At first, we select the some cluster to do GO analysis
# 11:Platelet, 2:CD8+ T
marker <- adata_marker[as.character(adata_marker$cluster) %in% c("2","11"), ]
marker$cluster <- as.character(marker$cluster)
marker <- split(marker$gene, marker$cluster)
# we can select the top 100 genes.
# t100 <- adata_marker %>% group_by(cluster) %>% top_n(n = 100, wt = avg_logFC)
# t100 <- split(top20$gene, top20$cluster)
entrezid <- lapply(marker, function(gr) as.numeric(bitr(gr, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db")$ENTREZID))
# PBMC GO
pvalueCutoff = 0.01
qvalueCutoff = 0.01
pbmc_go <- compareCluster(entrezid,
OrgDb='org.Hs.eg.db',
fun='enrichGO',
pvalueCutoff = pvalueCutoff,
qvalueCutoff = qvalueCutoff,
ont = "BP",
readable=T)
# plot the go
dotplot(pbmc_go,
title = paste0("PBMC Gene Ontology (qval < ", qvalueCutoff, ")"))